Int. J. Mol. Sci. 2013, 14, 11444-11483; doi:10.3390/ijmsl40611444 



OPEN ACCESS 



International Journal of 

Molecular Sciences 

ISSN 1422-0067 

www.mdpi.com/journal/ijms 

Review 

Silicon Era of Carbon-Based Life: Application of Genomics and 
Bioinformatics in Crop Stress Research 

Man-Wah Li, Xinpeng Qi, Meng Ni and Hon-Ming Lam * 

Center for Soybean Research, State Key Laboratory of Agrobiotechnology and School of Life Sciences, 
the Chinese University of Hong Kong, Shatin, N.T., Hong Kong; 
E-Mails: limanwah@cuhk.edu.hk (M.-WL.); qixinpeng@cuhk.edu.hk (X.Q.); 
nimeng@cuhk.edu. hk (M.N.) 

* Author to whom correspondence should be addressed; E-Mail: honming@cuhk.edu.hk; 
Tel.: +852-3943-6336; Fax: +852-2603-5646. 

Received: 31 January 2013; in revised form: 7 May 2013 / Accepted: 17 May 2013 / 
Published: 29 May 2013 

Abstract: Abiotic and biotic stresses lead to massive reprogramming of different life 
processes and are the major limiting factors hampering crop productivity. Omics-based 
research platforms allow for a holistic and comprehensive survey on crop stress responses 
and hence may bring forth better crop improvement strategies. Since high-throughput 
approaches generate considerable amounts of data, bioinformatics tools will play an 
essential role in storing, retrieving, sharing, processing, and analyzing them. Genomic and 
functional genomic studies in crops still lag far behind similar studies in humans and other 
animals. In this review, we summarize some useful genomics and bioinformatics resources 
available to crop scientists. In addition, we also discuss the major challenges and 
advancements in the "-omics" studies, with an emphasis on their possible impacts on crop 
stress research and crop improvement. 
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1. Introduction 

According to the Food and Agricultural Organization of the United Nations (FAO), food production 
must be increased by 70% in the next 40 years to meet the increasing global demand [1]. Abiotic and 
biotic stresses are major limiting factors hampering crop productivity. Therefore, understanding the 
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stress responses of crops using genomic information is important in bringing forth more effective crop 
improvement strategies. 

The publishing of the Arabidopsis thaliana genome in 2000 is a cornerstone of the plant genomics 
era [2]. Taking advantage of the high-throughput data acquisition platforms of the next generation 
sequencing technology, additional crop genomes have been subsequently decoded. So far, the draft 
genomes of more than 40 plants have been completed, including those processed in the 1000 Plant and 
Animal Project [3]. Other "-omics" technologies such as transcriptomics, proteomics, metabolomics, 
and phenomics (Figure 1) have also undergone rapid development in recent years. Together, there is a 
large volume of accumulated data, and hence data management and data mining have become a 
bottleneck for "-omics" researches. 



Figure 1. Infusion of biological "-omics" with bioinformatics. 
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To convert the great amount of data into manageable information, it is essential to establish 
standard formats and methods for storing, retrieving, and sharing data. Algorithms based on 
mathematical and statistical models are needed to handle biological data. This review aims to provide a 
systematic summary of the currently available databases and bioinformatics resources and highlight 
some challenges and advancements in the study of genomics and other "-omics", with emphasis on 
their implications on crop stress research. 



2. General Bioinformatics Resources 

2.1. Databases 



Various databases have been developed to accommodate the comprehensive -omics data and some 
of them also provide onsite analytical tools (Table 1). The three commonly used sequence databases 
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are GenBank in USA, European Nucleotide Archive (ENA) in Europe, and DNA Data Bank of Japan 
(DDBJ). They are collaboratively accommodated by the International Nucleotide Sequence Databases 
(INSD), and the deposited data are frequently synchronized. There are also repositories designated 
specifically for plants, such as Phytozome that holds the genomic information of more than 40 plant 
species, including all the sequenced crops. Besides basic genomic information, databases such as 
Legume Information System (LIS) facilitate synteny analyses and comparative genomic studies 
between closely related crop plants. 



Table 1. Example of some commonly used databases. 



Database name 


URL 


Reference 


GenBank 


http://www.ncbi.nlm.nih.gov/genbank/ 


[4] 


ENA 


http://www.ebi.ac.uk/ena/ 


[5] 


DDBJ 


http ://www.ddbj .nig. ac.jp/ 


[6] 


Phytozome 


http://www.phytozome.net/ 


[7] 


Gramene 


http://www.gramene.org/ 


[8] 


KEGG 


http://www.genome.jp/kegg/ 


[9] 


PlantGDB 


http ://www.plantgdb. org/ 


[10] 


EnsemblPlants 


http://plants.ensembl.org/index.html 


[11] 


VISTA 


http://genome.lbl.gov/vista/index.shtml 


[12] 


PLAZA 


http://bioinformatics.psb.ugent.be/plaza/ 


[13] 


GigaDB 


http://gigadb.org/ 


[14] 


SGN 


http://solgenomics.net/ 


[15] 


GrainGenes 


http://wheat.pw.usda.gov 


[16] 


LIS 


http://www.comparative-legumes.org/ 


[17] 



Online resources for individual crops, together with massive datasets, have been developed 
(Table 2) where systematically integrated information including: genetic resources (genetic maps, 
molecular markers, and quantitative trait loci (QTL)); genomic resources (DNA sequences, gene 
models, and regulatory elements); gene expression data (ESTs, cDNA sequences, and transcriptomes); 
and functional units (proteomic and metabolomic data), is provided. Crops of higher economic values 
are usually accompanied with a more comprehensive database. The genomic sequences of some 
economically less important crops, such as foxtail millet, sorghum, and barley, have been released 
recently [18-20] and their corresponding integrated databases are still under development. 

Some data repositories also provide information related to abiotic and biotic stress responses. For 
example, in MaizeGDB, there are well documented records for tropical maize exhibiting tolerance to 
drought stress [21]. In SoyBase, genetic markers associated with salt tolerance, drought tolerance, and 
cyst nematode resistance are incorporated with genomic and expression information. Databases for 
individual crops could also facilitate the unveiling of the genetic basis of specific traits. For example, 
the tomato genome sequence helped identify the R-genes which were then incorporated in the Plant 
Resistance Genes database [22]. 



Int. J. Mol. Sci. 2013, 14 



Table 2. Data repositories for crop plants. 



Crop 


Database name 


URL ol related database 


Kel 


Rice 


RAP-DB 


http://rapdb.dna.anrc.go.jp/ 


[23] 


Maize 


MaizeGDB 


http://www.maizegdb.org/ 


[24] 


Medicago 


Medicago truncatula 
SEQUENCING RESOURCE 


http ://www. medicago . org/ genome/ indexold.php 




Wheat 


GrainGenes 


http://wheat.pw.usda.gov/ 


[16] 


Potato 


Solanaceae Genomics Resource 


http://solanaceae.plantbiology.msu.edu/index.shtml 




Soybean 


SoyBase 


http://soybase.org 


[25] 


Tomato 


TOMATO FUNCTIONAL 
GENOMICS DATABASE 


http://ted.bti.cornell.edu/ 


[26] 



2.2. Biological Ontologies Related to Crop Stress Research 



The standardization of ontology is important for the structuring of huge datasets, interconnection 
between databases, merging resources, and curation of information. Each ontology term has its own 
name, identifier/ID/accession number and definition. The identifier/ID/accession number is usually 
made up of a prefix and a number. For example, the Gene Ontology term "lipid binding" has the 
accession number GO:0008289. The definition of "lipid binding" is a gene product that can interact 
selectively and non-covalently with a lipid. 

The Gene Ontology (GO) project provides a well-established and controlled vocabulary database 
for describing the function of a gene and its gene product. The ontology covers three aspects, including 
cellular component, molecular function, and biological process. GO is used in genome annotation to 
provide information on gene products. An evidence code (by Evidence Code ontology) is used to 
describe the evidence that links the GO annotation with the gene product. The Evidence Ontology 
(EO) suggests whether an annotation has been made manually by a curator or by automated electronic 
annotation. For example, EXP refers to: "inferred from experiment"; IBA refers to: "inferred from 
biological aspect of ancestor"; and IEA refers to: "inferred from electronic annotation". All this 
information can be found in the Gene Ontology website [27]. 

Plant Trait Ontology (TO) is a controlled vocabulary for describing the plant trait and phenotype. In 
addition to anatomical and morphological traits, TO also includes a subset of controlled vocabularies 
for abiotic and biotic stress traits. For example, the yellow dwarf disease resistance (TO:0000292) is 
the child term of resistance to disease by mycophasma-like organism (TO:0000013) under the lineage 
of stress trait (TO:0000164). 

There are many other biological ontology projects for different research fields. Ontologies listed in 
Table 3 contain the information related to crop stress responses. 
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Table 3. List of ontologies containing information related to crop stress responses. 



Domain 


Prefix 


Description 


Reference/website 


Plant Environmental 


EO 


Controlled vocabulary for the representation 


http://www.gramene.org/db/ 


Conditions 


of plant environmental conditions 


ontology/search?id=EO:0007359 


Gene Ontology 


GO 


Controlled vocabulary for genes and gene 
products 


[28] 


Taxonomy Ontology 


GRtax 


Representation of the taxonomic tree of plants 
in the ontology format 


http://www.gramene.org/db/ontology/ 
search/id=GR_tax:U9U165 


The Plant- Associated 
Microbe Gene 
Ontology 


PAMGO 


Controlled vocabulary for the interaction of 
microbes with their hosts 


[29] 


Plant Ontology 


PO 


Controlled vocabulary for anatomy, morphology 
and stages of development for all plants 
Controlled vocabulary for sequence 


[30] 


Sequence Ontology 


SO 


annotations, for the exchange of annotation 
data and for the description of sequence 
objects in databases 


[31] 


Plant Trait Ontology 


TO 


Controlled vocabulary for phenotypic traits in 
plants 


http://www.gramene.org/db/ontology/ 
search?id=TO:0000387 



3. Recent Advances and Challenges in Crop Genomics 

3.1. Polyploidy as a Major Challenge in Crop Genome Assembly 

Polyploidy is a major hindrance in crop genome assembly. One of the ways to tackle the highly 
polyploid genomes is to make references to the closely related, putative progenitor diploid genomes if 
they are available. The Catalogue of Life [32] and the Integrated Taxonomic Information System [33] 
may help to identify such related species. For example, the fiber-producing cotton (Gossypium hirsutum) 
is tetraploid, comprising an A-genome and a D-genome. To assist the assembly of the tetraploid 
genome, the diploid D-genome of G. raimondii was first sequenced and assembled [34]. A second 
example is strawberry (Fragaria x ananassa), with an estimated genome size of about 600 Mb. 
Although this is much smaller than other crop genomes, it is an octaploid (AAA'ABBB'B') [35]. 
Therefore, the genome sequence of the woodland strawberry (Fragaria vesca), a potential progenitor 
of Fragaria x ananassa, was completed in 2012 to provide the first diploid model for the genomes of 
F. spp. [35,36]. Wheat is another example of polyploid crop genomes. The hexaploid bread wheat 
{Triticum aestivum) contains the A, B and D genomes, which probably originated from Triticum urartu 
(A genome), Aegilops tauschii (D genome), and an unknown species related to Aegilops speltoides 
(B genome). The genomic sequence information of T. aestivum, T. monococcum (a community 
standard line related to the A-genome donor), and Ae. Tauschii, as well as the cDNA sequence 
information of T. aestivum and Ae. Speltodies, were obtained [37]. With reference to the respective 
diploid genome information, over 90% of the wheat genes were successfully assembled into the A, B, or 
D genome with over 70% precision [37]. The drafted de novo genomes of T. urartu and Ae. tauschii 
were recently published, representing 94.3% and 97.0% of the predicted genome sizes respectively [38,39]. 
Although the lack of a good reference for the B genome is still an obstacle in building the T. aestivum 
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genome, these pieces of work have built a good framework for the further whole genome assembly of 
bread wheat, and established a model for the study of other polyploid genomes. 

3.2. Reduced Genetic Diversity of Modern Crops 

Modern crops originated from a small number of plants. Bottleneck effects during domestication 
and prolonged human selection together have significantly reduced the genetic diversity of modern 
crops. Such a reduction in genetic diversity has been confirmed by several genomic studies 
(Supplementary Table SI). For example, whole-genome resequencing of 14 cultivated and 17 wild 
soybean genomes revealed that the wild soybeans have higher numbers of SNPs and genetic diversity 
compared to those of the cultivated ones [40]. The domesticated rice cultivars {Oryza sativa indica and 
Oryza sativa japonica) also show a lower genetic diversity than their wild relatives (O. rufipogon and 
O. nivara) in a study on 50 accessions of cultivated and wild rice [41]. More interestingly, even though 
both indica rice and japonica rice are cultivated, the japonica rice shows significantly lower genetic 
diversity than the indica rice, suggesting that the japonica rice has suffered from a stronger bottleneck 
effect under domestication [41]. On the other hand, although maize landraces and improved lines have 
retained a higher nucleotide diversity from their wild progenitor, as compared to other self-fertilizing 
crop species, a weak bottleneck effect can still be observed [42]. Reduced genomic diversity of major 
staple crops limits their adaptability to the changing environment and reduces the room for crop 
improvement. Therefore, crop improvement programs should turn their focus to the genetically 
compatible wild species, which have higher biodiversity and can serve as natural genetic reservoirs. 

3.3. Sequence and Structural Variations in Genomes Providing Clues for Stress Studies 

Sequence differences and structural variations in genomes are usually identified by comparing the 
genomes of wild species to their related landraces and modern cultivars, and also to other model 
plants. These differences can, on the one hand, provide information about genome evolution, and, on 
the other hand, serve as molecular markers for genetic mapping. Sequence differences and structural 
variations that affect gene structure, gene expression, and gene copy number are major determinants 
shaping the diversity among different varieties of the same species. For instance, wild soybeans and 
some rice accessions possess some present/absent variations or unmapped contigs that contain 
bona fide genes annotated to be involved in abiotic and biotic stresses [40,41,43,44]. One specific 
example relating to biotic stresses is the enrichment and over-representation of LRR (leucine-rich 
repeat) and NB-ARC (nucleotide-binding adaptor shared by APAF-1, certain R gene products and 
CED-4) domain-containing genes in some crop genomes [19,45]. In plant genomes, disease resistance 
(R) genes are responsible for defense responses [46]. LRR and NB-ARC are two important domains 
found on the R proteins [46]. The LRR domain-containing proteins play important roles in 
pathogen-host interactions and the activation of defense responses [47,48]. On the other hand, the NB-ARC 
domain is responsible for the mulitmerization and autoactivation of the R proteins upon stimulus [49]. The 
LRR and NB-ARC-containing genes exhibit higher ratios of nonsynonymous-to-synonymous SNPs 
than the genome average in crops such as soybean [40], rice [41,50], and sorghum [51]. In maize, 101 
out of 3490 large-effect SNPs detected are located on 49 LRR domain-containing genes [44]. LRR and 
NB-ARC domain-containing genes are important components in the plant defense response 
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system [46,49,52] while the high nonsynonymous-to-synonymous SNP ratio of LRR or NB-ARC 
domain-containing genes suggests a dynamic evolution of these genes to combat pathogens. 

In addition to disease resistance genes, some transcription factors are found to be over-retained after 
the whole-genome duplication in Musa a/p (banana) [53]. Some of these transcription factors such as 
Myb, AP2/ERF, and WRKY are known to be important regulators in abiotic stress responses [54]. On 
the other hand, compared to rice, sorghum, and maize, there are more genes encoding for cytochrome 
P450, CCAT-binding factor transcription factors, late-embryogenesis-abundant proteins, and 
osmoprotectant biosynthesis proteins in the Ae. tauschii genome (progenitor B genome of wheat) [38]. 
These genes are important for the adaptation to cold and physiological drought. Moreover, a 
significantly higher number of transmembrane ATPase subunits, which are probably involved in Na + 
exclusion and mineral uptake, have been detected in Ae. tauschii than in wheat [37,38]. The extra 
genes in Ae. tauschii may be good candidates for wheat improvement. 

3.4. Advances in Ultra-High-Density Genetic Mapping Using SNPs 

Genetic mapping using genetic populations is one classical strategy to identify genes related to 
stress responses. Members in the mapping population can either be related (e.g., QTL mapping using 
bi-parental populations) or unrelated (e.g., genome-wide association study (GWAS) using germplasm 
collections) (for population structure, data characteristics and methods, see reviews [55,56]). There are 
some successful cases in identifying stress tolerance causal genes through mapping [57-59]. For 
example, a salt tolerance-conferring sodium transporter from rice was identified through QTL 
mapping [58]. The SKC1 locus corresponding to shoot K + content was mapped with a BC2F2 
population generated from a cross between a salt-tolerant indica variety and a susceptible japonica 
variety [58]. The SKC1 locus was further confined to a 7.4-kb stretch by the BC3F4 progeny testing of 
fixed recombinant plants. The locus contains only a single open reading frame, which encodes for a 
HKT-type transporter. SKC1 near-isogenic lines accumulated less Na + under salt treatment compared 
to the susceptible parent. Voltage-clamp also supports the notion that the SKC1 protein functions as a 
Na + -selective transporter that probably regulates K + /Na + homeostasis under salt stress [58]. 

Classical molecular markers for mapping such as AFLP, RFLP, and SSR markers are sparsely 
distributed in the genome, and hence limit the mapping resolution and pose difficulties in pinpointing 
the pheno type-causal genes. With the availability of genomic sequence data, SNP markers become 
more accessible for use in mapping, to help achieve a much better resolution. However, conventional 
PCR-based methods are laborious and time-consuming while the resolution of array-based methods is 
limited by the number of probes on the array. 

High-resolution genotyping by whole-genome resequencing has been established [60,61], making 
the ultra-high-density genetic mapping more attainable. In principle, this method can achieve the 
highest resolution, provided that there are enough resources to capture all the SNPs in a population. 
In reality, polymorphic SNPs are usually captured by low-coverage sequencing (~1X for unrelated 
populations [58] and <0.1X for recombinant populations [60,62,63]). 

In a QTL study of recombinant inbred populations originating from indica and japonica rice, SNPs 
between the parental reference genomes were first identified using DiffSeq in the EMBOSS package 
and cleaned by SSAHASNP in the ssaha2 package. Low-depth sequencing reads of recombinant 
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inbred lines (RILs) were mapped to the parents' pseudomolecules by using the SSAHA2 software [64] 
to determine the genotype of each RIL. SNPs were analyzed by a sliding window approach to 
determine the recombinant break points within the genome of every single line in the population to 
form a bin map [60]. This sliding window strategy can accommodate the high error rate of next 
generation sequencing and allow missing data resulting from low-coverage sequencing [60]. Each 
"bin" will serve as a "marker" in the subsequent linkage map construction using MAPMAKER/EXP 
and in QTL mapping using QTL Cartographer. In this study, using 150 rice RILs, the 
sequencing-based method increased the resolution by 3 5 -fold and greatly reduced the time needed for 
genotyping, compared to the map generated from 287 PCR-based markers [57]. The power of this 
method was further illustrated in a study using 210 rice RILs to map the GS3 and GW5lqSW5 loci 
related to the grain length and grain width, respectively [62]. 

Since missing genotypes in low-depth sequencing would reduce the effectiveness of GWAS, after 
SNPs have been identified by mapping the sequencing reads, the ^-nearest neighbor method (KNN) 
that uses in-house algorithms for data-imputation can be adopted in addition to increasing the 
sequencing depth, in order to reduce the missing genotypes [61]. GWAS has been conducted in 
mapping 14 agronomic traits, including drought tolerance, using 373 indica rice lines. One to seven 
loci have been mapped for each trait, and some of them overlap with the previously known loci/genes 
identified through bi-parental QTL mapping or mutant studies [61]. With the great reduction of 
sequencing cost (<US$0.1 per raw megabase in 2012) [65], we anticipate that mapping by sequencing 
will become a popular method to obtain high resolution maps for stress-related loci/genes. 

3.5. Genomic Selections 

Genomic selection (GS) is introduced to evaluate the overall effects of all contributing loci 
genome -wide [66]. During the process of GS, a training population will be used for computational 
model training to obtain the genomic estimated breeding values (GEBVs) [67]. Complex traits such as 
drought tolerance are usually determined by multiple small-effect QTLs. GEBV associates markers 
and QTLs by regarding all the markers as variables contributing to the trait and the effect of each 
marker allele towards the complex QTLs is quantified (it can be zero). GEBV determines the sum of 
the marker effects and thus indicates the breeding value of an individual; favorable individuals with 
high GEBVs from breeding populations will be selected for field application. Genotypic and 
phenotypic information of the breeding population can be used to further improve the computational 
model to form a training-breeding cycle [66]. Unlike GWAS and QTL studies, which are designed to 
reduce the breeding time by selecting plants with desired molecular markers at early growth stages 
instead of evaluating the actual phenotypes at a later stage, GEBVs serve only as selection criteria but 
do not lead to target markers or causal genes. 

As high-throughput genotyping and phenotyping have accelerated GS studies by increasing marker 
density and selection capacity, one of the major challenges of GS is selection accuracy. Evaluations of 
GS accuracy have been performed in maize [68], wheat [69,70], barley [71], and cassava [72]. Several 
statistical models for GEBV calculations, including best linear unbiased prediction (BLUP) [67], 
Bayesian shrinkage regression (BayesA, BayesB, etc.) [73], and mixed models have been employed. 
There is no agreement on which model is the most efficient, because many factors such as population 
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size and genetic background may affect statistical power [71]. It is believed that GS is a valuable 
approach for plant breeding [74], however, it will take some time for this concept to develop into a 
practical tool [75]. A GS-based breeding scheme has already been proposed and is considered to be an 
important tool for developing durable stem rust-resistant wheat [76]. 

3. 6. Identification of Stress-Related Gene Families 

When properly annotated genomes are available, the genome -wide identification of all members of 
a gene family will become feasible. Since genome duplication (polyploidy or paleopolyploidy) and 
single gene duplication are common in crops [77], genes usually exist in multiple copies and/or in gene 
families. Identifying all members of a gene family may give a more comprehensive view on the 
possible functions of a group of evolutionarily related genes. Bioinformatics tools such as 
Fgenesh [78], GAZE [79], and JIGSAW [80] have been adopted for searching gene families in crops. 

Two typical ways to identify members of gene families from within a genome are keyword search 
and pattern/homology search. Keyword search usually requires precise keywords including gene 
names and controlled vocabularies. The most commonly used controlled vocabularies are Gene 
Ontology, as mentioned in section 2.2, and the functional classification by Pfam, InterPro and 
KEGG [81-83]. 

A genome-wide pattern search usually begins with searching sequence databases using programs 
like BLASTP or TBLASTN [84]. Databases can either be online resources (Table 1) or in-house 
databases. The occurrence of the desired functional domains in the potential sequences can then be 
verified using the Pfam protein families database [81], SMART database [85], or HMMER [86]. When 
the BLAST results are associated with unannotated sequences, these will require further analyses to 
determine the putative gene structures. One example of applying the above strategy to identify 
stress-related genes is the analysis of AP2/EREBPs in the rice genome [87]. "AP2/EREBP" was used 
as the keyword in searching databases, including DRTF, MSU NCBI, and KOMBE. Any 
non-redundant sequences obtained were then used as query terms in the TBLAST and BLASTP 
searches of the MSU and NCBI databases. Four genes with an incomplete AP2 domain were excluded 
after Pfam and SMART analyses because of their very small AP2/ERF domain. A total of 163 genes 
were identified using this method, in contrast to the 139 genes as suggested previously [87]. 
Expression studies revealed that a number of the members are responsive to abiotic or biotic stresses. 
A few of them can even be induced by multiple stresses, suggesting their possible involvement in 
stress responses [87]. 

Supplementary Table S2 summarizes the strategies and tools used in recent literature on 
genome -wide analyses of gene families related to stress responses in major crops. 

4. Functional Genomics 

4.1. Trans criptome 

There are two major technologies for obtaining the overall transcription map of specific plant tissues: 
hybridization-based microarray technology [88] and next generation RNA sequencing technology 
(RNA-seq) [89]. RNA-seq technology, in conjunction with efficient bioinformatics tools, is now more 
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widely used to support predicted gene models, extract differentially expressed genes, and find novel 
transcripts in de novo assemblies. Public repositories such as ArrayExpress [90] are designed for the 
storage of expression data. Standard data formats including Minimum Information about Microarray 
Experiments (MIAME) or Minimum Information about Sequencing Experiments (MINSEQE) are 
unified to facilitate transcriptome data submission/downloading. Bioinformatics tools dealing with 
transcriptome alignment, splicing event prediction, and de novo assembly are also available (Table 4). 



Table 4. Widely used bioinformatics tools for the analysis of transcriptome data. 



Software 


Description 


Download URL 


Reference 


ABMapper 


RNA-seq data alignment 


http://hkbic.cuhk.edu.hk/software/abmapper 


[91] 


Bowtie 


RNA-seq data alignment 


http://bowtie-io.sourceforge.net/bowtie2/index.shtml 


[92] 


Cufflinks 


Transcript assembly 


http :// cufflinks .cbcb .umd.edu/ 


[93] 


DEGseq 


Differential gene expression 
detection 


http://www.bioconductor.Org/packages/2. 1 1/bioc/html/DEGseq.html 


[94] 


Infernal 


RNA-seq data alignment 


http://infernal.janelia.org/ 


[95] 


Oases 


De novo assembly 


www.ebi.ac.uk/~zerbino/oases/ 


[96] 


Tophat 


RNA-seq data alignment & 
Alternative splicing detection 


http : //tophat . cbcb .umd. edu/ 


[97] 


Trans- AByss 


De novo assembly 


http://www.bcgsc.ca/platform/bioinfo/software/ 


[98] 


Trinity 


De novo assembly 


http://trinityrnaseq.sourceforge.net/ 


[99] 



Crops such as maize [100] and soybean [101] have their own transcriptome atlases, compiled from 
sub-transcriptomes from multiple tissues and different developmental stages. For the transcriptome 
atlas of soybean, plant ontology (PO) was used to describe the developmental stage of each 
experimental tissue, providing a common ground for readers and users to discuss and perform further 
analyses. The cDNA short reads generated by Illumina Genome Analyzer were aligned to the soybean 
reference genome sequence assembly using GSNAP, released in 2005. The digital expression counts 
were determined using the R programming language and normalized using a variation of RPKM 
methods [101]. The global inventory of expressed transcripts of crops under stress is dynamic, both 
temporally and spatially. Time series sampling is a typical experimental design to trace the trajectory 
of such differentially expressed transcripts of crops under stress conditions. A typical example was the 
study of the soybean transcriptome under alkaline stress. Soybean plants were treated with NaHCCb 
and transcriptomes were analyzed using microarray [102]. GO terms were successfully assigned to the 
1380 significantly changed probe sets that are related to metabolism, signal transduction, energy, 
transcription, secondary metabolism, transporter, as well as disease and defense. A time series study 
revealed the interplay of signal transduction and metabolism during the progression of the treatment. 
MapMan tools were used to visualize these changes [102]. Other time series studies include the studies 
of rice root under low potassium [103], cassava under cold stress [104], and soybean subjected to 
Pseudomonas syringae infection [105]. The other widely reported experimental design is the 
comparative transcriptome study performed among crop accessions with different degrees of stress 
tolerance, such as the study of soybean accessions exhibiting differential tolerance toward low 
potassium [106], rice cultivars with contrasting abilities to withstand drought [107] and chilling [108], 
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wheat with differential drought tolerance [109], and Medicago [110] and foxtail millet [111] cultivars 
with differential salt tolerance. 

Another strategy to associate transcript abundance to genomic variations is the expression QTL 
(eQTL), which use differentially expressed transcripts as the quantitative traits [112]. The eQTL maps 
of maize root [113] and rice shoots [114] have identified thousands of cis and trans regulation factors 
by population transcriptome screening. The eQTLs co-localizing with traditional QTL regions could 
give supportive evidence explaining the genetic basis of the targeted phenotypic characters. One 
successful example is the eQTL study of the partial resistance toward Puccinia hordei in barley [115], 
in which some eQTLs were reported to co-localize with previously known rust resistance QTL regions. 

4.2. Proteome 

Due to the alternative splicing of RNA transcripts and post-translational modifications of the 
proteins themselves, the proteome within a cell can be much more complicated than the corresponding 
genome. The gel-based proteomics technology will soon be obsolete due to its limited sensitivity and 
semi-quantitative nature [116]. The rise of the next generation proteomics systems such as Orbitrap 
and QStar, together with the application of isotopic tag-based quantitative proteomics (ICATs [117], 
SILAC [118], isobaric tag-based quantitative proteomics (ITRAQ [119]), and label-free quantitative 
proteomics (MaxQuant [120], Serac [121], SIEVE (Thermo Scientific, San Jose CA, USA)) have 
expedited the development of high- throughput proteomic studies. Nevertheless, the pace of adopting 
these platforms in plant stress studies is far behind studies in humans. 

Despite the advancement in the proteomics platforms, the application of de novo peptide sequencing 
is still limited. Protein identifications still largely rely on database searches in which experimental 
peptide mass spectra are compared with theoretical peptide mass spectra generated from existing 
sequence databases. Some commonly used databases and useful algorithms are summarized in Table 5. 
Since the genomes of many crops have not been completely sequenced, and some others are still 
unknown, proteins of species without a genome database are frequently identified by referring to 
cross-species databases. In these cases, it is not uncommon that molecular weights and isoelectric 
points (pi) of the identified proteins may deviate from the actual spot position on the 2D gel, despite 
the high protein scores. 

Comprehensive reviews summarizing plant proteomic studies from 2006 to 2008 are 
available [122,123]. We have also summarized the plant proteomic studies in 2012 (Supplementary 
Table S3). Recently, plant proteomic investigations have been subdivided into several areas, including 
subcellular proteomics and proteomics-related post-translational modifications. For example, 21 
differentially expressed proteins were identified from salt-treated wheat chlorop lasts [124], and 13 and 
11 differentially expressed microsomal proteins, respectively, were identified from two distinct 
cadmium- accumulating soybean cultivars [125]. 

Stress-induced posttranslational modifications of proteins are common. They are either the results 
of deleterious damage from the stress, or beneficial modifications to regulate the functions of the 
proteins in order to cope with the stress. To study posttranslational modifications of proteins, special 
techniques within proteomics are used. Redox proteomics requires special labeling methods, including 
the reduction and subsequent labeling of the oxidized thiol groups with 5 -iodoacetamido fluorescein 
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(IAF) [126]. Twenty-two highly oxidized proteins involved in a wide range of biological processes 
were identified in ozone-treated rice using this method [127]. Phosphoproteome [128], glycoproteome, 
and secretome [129-132] are sub-categories of proteomics that require special staining and enrichment 
techniques. Post-translational modifications involved in gene expression regulations will be discussed 
in the Epigenomics section below. 



Table 5. Bioinformatics resources commonly used in crop proteomic studies. 



Program 


URL 


Mascot 


http://www.matrixscience.com 


SEQUEST 


http://fields.scripps.edu/sequest/index.html 


X! Tandem 


http://www.thegpm.org/TANDEM/index.html 


Database 


URL 


Expasy 


http://www.expasy.org/ 


UniprotKB/SwissProt and UniProtKB/TrEMBL database 


http://www.uniprot.org/help/uniprotkb 


Protein Information Resource (PIR) 


http://pir.georgetown.edu/ 


RCSB Protein Data Bank (RCSB PDB) 


http://www.rcsb.org/pdb/download/download.do 


EMBL-EBI's Protein Data Bank in Europe (PDBe) 


http://www.ebi.ac.uk/pdbe/ 


SWISS-2DPAGE 


http://world-2dpage.expasy.org/swiss-2dpage/ 


The Plant Proteome Database (PPDB) 


http://ppdb.tc.cornell.edu/ 


Plant Protein Phosphorylation DataBase (P 3 DB) 


http://www.p3db.org/ 


RIKEN Plant Phosphoproteome Database (RIPP) 


https://database.riken.jp/sw/links/en/rial02i/ 


Secretom — The Tomato Fruit Glycoproteome 


http://solgenomics.net/secretom/detail/glycoproteome 


Plant Secretome KnowledgeBase (PlantSecKB) 


http://proteomics.ysu.edu/secretomes/plant.php 



4.3. Interactome 

Protein-protein interactions determine the contextual functions of a protein and hence play a crucial 
role in regulation and signal transduction [133]. There are several commonly used experimental 
systems to identify protein-protein interactions, including: (1) yeast two hybrid (Y2H) (reviewed 
in [134]); (2) biomolecular fluorescence complementation (BiFC) (reviewed in [134]); (3) affinity 
pull-down coupled with mass spectrometry (AP-MS) (reviewed in [134]); (4) blue native PAGE [135]; 
and (5) structural analysis of protein crystals [136,137]. In addition, literature curation involving 
tedious literature searches can be used to supplement the experimental efforts [138] and in silico 
prediction can be done by searching for orthologous pairs which interact in other systems, to identify 
possible interologues [134,139]. Multiple systems are generally adopted to authenticate the interactions. 

The concept of the plant interactome was initiated years ago, and was based mainly on the 
information collected through literature curation [140]. Subsequently, an experimentally constructed 
interactome map of A. thaliana was established via intensive screening, recording a total of 
6200 high-confidence interactions among 2700 proteins through the screening of proteins encoded by 
8000 open reading frames in the Arabidopsis genome [141]. It is estimated that this screening only 
captured around 2% of the binary protein-protein interactome in A. thaliana [141]. Using the in silico 
interolog prediction method, more than 37,000 interactions among 4567 rice proteins were predicted, 
168 of which have been experimentally confirmed [139]. In this piece of work, the INPARANOID 3.0 
program was used to predict high-confidence protein orthologues in 12 species including rice. With the 
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assumption that protein-protein interactions are retained in evolutionarily conserved orthologous 
proteins, rice protein-protein interactions were compiled using the predicted orthologous proteins and 
the known interactions in interactome databases [139]. Only a few studies directly related to crop 
stress interactomes have been published (Table 6). In the search for rice stress-related interactomes, 4 
stress proteins related to disease (XA21 and NH1) and flooding (SUB1A and SUB1C) were used as 
baits for the initial interactome screens by Y2H [142]. Preys identified from the initial screens were 
then used as baits for subsequent screens. Together with the information from literature curation, an 
interactome network consisting of 100 proteins were constructed. The interactomes of the two kinds of 
stresses were linked by proteins such as SNRK1A, which has been shown to be related to ABA, a 
positive regulator of abiotic stress responses and a negative regulator of biotic stress responses [142]. 

Online resources such as PRTN [143] can help to predict rice interactomes, while BioGRID [144], 
DIP [145], PlaPID [146], and InAct [147] can be queried for some pre-determined interactomes in 
certain plant species. Recent large-scale stress interactome studies in crop plants are shown in Table 6. 



Table 6. Recent large-scale stress interactome studies in crop plants. 



Species 


Stress 


Strategy 


Reference 


Rice 


Abiotic and biotic 


Using stress components as bait in Y2H 


[142,148] 


Wheat 


Cold and dehydration 


Using stress components as bait in Y2H 


[149] 


Soybean 


SCN infection 


In silico prediction 


[150] 



4.4. Epigenome 

In addition to the genetic information encoded by DNA, epigenetic modifications of DNA and 
histones provide another dimension of regulation to influence gene expressions. Chromatin- associated 
proteins, including DNA methylase, histones, and histone-modifying enzymes, are cataloged in the 
ChromDB [151]. Technological platforms for epigenomic research can be considered as an extension 
of genomic and proteomic studies with modifications in analysis protocols. 

For example, cytosine DNA methylation, one of the epigenetic modifications, plays an important 
role in gene silencing and genomic imprinting [152,153]. The transcriptional levels of endogenous 
genes are highly correlated with the methylation status within their promoter or transcribed 
regions [154,155]. One way to detect DNA methylation is to capture and enrich the methylated DNA 
fragments by immunoprecipitation [156]. Bisulfite treatment is another way to distinguish between 
methylated and unmethylated DNA. The bisulfite treatment converts unmethylated (but not methylated) 
cytosines to uracils [157]. Both immunoprecipitation-enriched and bisulfite-treated DNA can be 
analyzed by microarray- or sequencing-based methods to the single-base level of resolution [157-159]. 
A number of bioinformatics tools are designed to handle the bisulfite sequencing data (Table 7). 

Both biotic and abiotic stresses will lead to massive changes in the DNA methylation status [160-162]. 
Some stress-induced DNA methylations can be inherited by the next generation. The mechanism 
for trans-generation DNA methylation may be partially mediated by small RNAs [163]. This 
trans-generation DNA methylation has been observed in some crops in response to stress [164,165], as 
a way of pre-acquiring immunity toward the upcoming stresses via designed parental priming [164]. 
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Table 7. Bioinforaiatics tools for the analysis of bisulfite sequencing data. 



Tools 


Descriptions 


Reference 


BiQ 

Analyzer HT 


Quantitative study of locus-specific DNA methylation patterns from bisulfite sequencing data. 


[166] 


Bismark 


Mapping of bisulfite-sequencing reads and methylation calling. 


[167] 


BRAT-BW 


Genome -wide single base-resolution methylation data analysis 


[168] 


B Smooth 


Providing estimate of methylation profiles with low-coverage whole -genome 
bisulfite-sequencing data. 


[169] 


BS seeker 


Mapping of bisulfite-sequencing reads. 


[170] 


BSMAP 


Bisulfite reads mapping algorithm. 


[171] 


CpG_MPs 


Analysis of biosulfite-sequencing read and identification of genome-wide methylation pattern 


[172] 


GBSA 


Both gene-centric or gene-independent analyses of whole-genome bisulfite sequencing data 


[173] 


Kismeth 


Analysis of plant bisulfite sequencing results, with a tool for designing bisulfite sequencing 
primers. 


[174] 


QUMA 


Quantification tool for methylation analysis 


[175] 


RRBSMAP 


Derivative of BSMAP — a specific tool for reduced-representation bisulfite sequencing 


[176] 



Histone proteins are responsible for the packing of DNA. The epigenetic modifications of core 
histones affecting the tightness of DNA packing are called histone codes that can relay important 
information to affect gene expressions [177]. The Histone Sequence Database provides a 
comprehensive collection of histone sequences and structural information [178]. 

The addition of acetyl groups to histones neutralizes the positive charges and hence loosens the 
condensed DNA, leading to transcriptional activation [179], while the methylation of histones results 
in gene deactivation or repression [180]. The phosphorylation of histones causes the relaxation of 
chromatin and modulates histone acetylation and methylation [181]. 

Individual types of histone modifications on specific amino acid residues can be detected using 
specific antibodies or various mass spectrometries while genome-wide histone -DNA associations can 
be captured by chromatin immunoprecipitation (ChIP) and subsequently analyzed using either 
microarray (ChlP-Chip) [182] or sequencing (ChlP-seq) [183]. 

Some histone-modifying enzymes are induced in crops under stress. For example, a trithorax-like 
H3K4 methyltransferase was found to be induced by drought in drought-tolerant barley cultivars [184] 
while a histone deacetylase was found to be induced by compatible infections and repressed by 
incompatible infections [185]. The methylation statuses of four transcription factors were affected by 
salt stress. The expression of three of these transcription factors were also found to be correlated with 
their H3 methylation and acetylation statuses [186]. A genome -wide study in rice identified 4837 
genes that harbor differential H3K4me3 modification under drought stress, in which the expression of 
609 genes were significantly correlated with the H3K4me3 modification [187]. 

4.5. Phenome 

Every observable biological characteristic beyond the genotype can be regarded as the phenotype. 
Pheno types can be observed at the molecular, cellular, organismal, and population levels. Pheno types 
also vary throughout the organism's lifecycle, spanning different growth stages, and during different 
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periods of stress. The environment can also exert significant influences on the phenotype. The total 
sum of phenotypes of an organism or a population constitutes the phenome. 

As mentioned in section 2.2, to make phenotypic data in public databases more searchable and 
accessible to users of bioinformatics tools, ontologies are used to describe the setup of the experiment 
and the phenotypic data. For example, one may study salt tolerance (TO: 0006001) at the whole -plant 
flowering stage (PO:0007016) and the days to flower (TO:0000344) of Oryza sativa (GR_tax:013681), 
in a greenhouse study (EO:0007248) under a sodium chloride regimen (EO:0007048). These 
ontologies provide a common language to describe an experiment and render it understandable by both 
researchers and computational algorithms. For instance, some people may record certain phenotypes 
during the flowering stage. However, what does it mean by "flowering stage"? Some people refer to 
"flowering stage" as the time when the first flower opens. Others may refer to it as having half of the 
individual plants with flowers opened. In this case, the flowering time is well defined in plant 
ontology. PO:0007026, PO:0007034, PO:0007053 and PO:0007052 refer to the stage at which the first 
flower, 1/4 of the flowers, 1/2 of the flowers, and 3/4 of the flowers, open, respectively. PO:0007024 
marks the end of the flowering stage. The application of these ontologies can thus reduce the 
discrepancies in annotating the phenotypes and treatment conditions. 

High-quality phenotypic information is essential for mapping, association studies, gene 
identifications, gene functional studies and genomic selections. To design experiments to collect 
phenotypic information, some critical parameters have to be considered, such as the sample/population 
sizes, experimental conditions, phenotypes to be assessed, and the data acquisition methods. 



Table 8. Databases for mutant and germplasm resources. 



Species 



Databases 



URL 



Barley Barley DB 

Barley NordGen Plant Collection 

Maize RescueMu Maize Mutant Phenotype Database 

Rice Oryza Tag Line 

Rice Rice Tosl7 Insertion Mutant Database 

Soybean SoyBase — Fast Neutron Mutants 

Tomato Genes that Make Tomatoes 

Tomato LycoTILL 

Tomato TOMATOMA 

Tomato/Pea URGV TILLING Database 

Tomato/Potato SOL genomics network 

Wheat The Scottish Wheat Variety Database 



http://www.shigen.nig.ac.jp/barley/ 

http ://www.nordgen.org 

http://maizegdb.org/rescuemu-phenotype.php 

http ://oryzatagline . cirad.fr/ 

http://pfg 1 0 1 .nias.affrc.go.jp/ 

http://www.soybase.org/mutants/index.php 

http://zamir.sgn.cornell.edu/mutants/index.html 

http://www.agrobios.it/tilling/index.html 

http ://tomatoma.nbrp.jp/index.j sp 

http : //urg v . e vry . inr a. fr/UTILLdb 

http ://solgenomics .net/ 

http://wheat.agricrops.org/menu.php 



The size of a population can vary from a few plants for functional studies, several hundred lines for 
mapping and GS, to as many as a thousand germplasms for GWAS. Some public collections of 
germplasms or populations are available for public requests. The United States Department of 
Agriculture National Plant Germplasm System has a collection of over 500,000 germplasm accessions 
from 10,000 plant species including rice, soybean, tomato and many other staple crops. Table 8 
summarized some publicly available mutant and germplasm collections, some of which also provide 
phenotypic descriptions and photos of the mutant. 
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Since the phenome is the overall outcome of the interactions between the genotype and the 
environment, whether the phenotypic data are collected in a controlled environment or not can greatly 
affect the final interpretation of results. Field experiments can better mimic the actual conditions of 
crop production, but the consistency of the phenotype greatly depends on the location of the field, the 
soil composition, weather conditions, season, and so on. The interpretation of results can thus be 
complicated. For example, a change in the transpiration rate in some of the plants may not solely be 
the result of the stress treatment, but also the result of localized changes in light intensity and/or 
temperature in the field [188]. In general, a larger number of replications are required to compensate 
for the effects due to environmental variations. A controlled environment such as that in a greenhouse 
or a growth chamber can minimize the effects of environmental fluctuations and hence will emphasize 
the contribution of the genotype. However, data from controlled experiments are usually limited in 
scale and may overlook the fast-changing environment in the real production field. 

Choosing the appropriate phenotypes to be assessed is also important. For example, stomatal 
conductance and pathogen titer are good indicators of osmotic stress tolerance and disease resistance, 
respectively. However, they are not quite applicable in large-scale experiments due to the limitation of 
the machine and the laborious procedures. On the other hand, fresh weight and biomass can truly 
reflect the productivity of the crops, but taking these measurements is destructive to the plant. For 
morphological and physiological phenotyping of crops under stress, a conversion of stress symptoms 
to parameters that can be captured and digitized is needed for high-throughput automation. Commonly 
used methods include: 2D or 3D visible light imaging [189,190], infrared thermography [188], 
near-infrared imaging, spectral reflectance [191], fluorescence analysis [191,192], stable isotope 
analysis [193] and X-ray imaging [194]. For example, a study of wheat salt stress response suggested 
that the shoot area calculated from 3 digital images (2 side and 1 top images) showed a strong positive 
correlation with manually measured leaf area and shoot fresh weight which commonly serve as the 
indicators of salt tolerance in crops [195]. As a non-destructive method, the imaging system could 
continuously monitor the growth of the plant and distinguish its bi-phasic (osmotic stress phase and 
ionic stress phase) growth under salinity stress [195]. Another example is related to osmotic stresses 
(salinity and drought) that reduce stomatal conductance. Since the reduction in stomatal conductance 
will halt the cooling effect of transpiration, infrared thermal imaging can be used to monitor the degree 
of salinity and drought stress [188,196]. In the case of lesions on leaf surfaces caused by plant 
diseases, instead of measuring the lesion area on each leaf, determining the reduction in chlorophyll 
fluorescence is a possible alternative [197]. 

In addition to the physiological phenotypes, metabolite profiles in crops are also altered by both 
biotic and abiotic stresses [198,199]. Deleterious metabolites such as reactive oxygen species might be 
generated through the disruption of normal cellular processes while beneficial metabolites such as 
signaling molecules and osmoprotectants may be generated to alleviate the stress [200,201]. 

Current platforms for metabolomic analyses include various forms of liquid 
chromatography-coupled mass spectrometry (LC-MS), gas chromatography-coupled MS (GC-MS), 
capillary electrophoresis-coupled MS (CE-MS), fourier transform MS (FT-MS), fourier transform 
infrared spectrometry (FT-IR), and one- or two-dimension nuclear magnetic resonance (NMR). Raw 
spectra generated by mass spectrometry or NMR can be analyzed by making references to either 
in-house or online databases (Table 9). Biological multivariate data generated from metabolomic 
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studies are commonly analyzed using principal component analysis (PC A) [202], partial least square 
(PLS) [203], and orthogonal projections to latent structures (O-PLS) [203,204]. 



Table 9. Mass spectra databases and other bioinformatics resources for metabolomic studies. 



Database 


URL 


Reference 


AOCS Lipid Library 


http://lipidlibrary.aocs.org/index.html 


- 


Golm Metabolome Database 


http://gmd.mpimp-golm.mpg.de/ 


[205] 


Lipidomics Gateway 


http ://www . lipidmaps . org/ 


[206] 


Madison-Qingdao Metabolomics 
Consortium Database 


http://mmcd.nmrfam.wisc.edu/ 


[207] 


Manchester Metabolomics Database 


http ://dbkgroup. org/MMD/ 


[208] 


MassBank 


http://www.massbank.jp/index.html 


[209] 


Metabolome Express 


https://www.metabolome-express.org/ 


[210] 


METLIN 


http://metlin.scripps.edu/ 


[211] 


NIST Chemistry WebBook 


http://webbook.nist.gOv/chemistry/#Notes 




Software/Tools 


URL 


Reference 


AMDIS 


http://www.amdis.net/ 




COLMAR 


http://spinportal.magnet.fsu.edu/ 


[212] 


MetDat 


http://smbl.nus.edu.sg/METDAT2/ 


[213] 


MetaboSearch 


http://omics.georgetown.edu/MetaboSearch.html 


[214] 



Numerous metabolomic studies have been done on crops under stress, including: salinity [215-217], 
drought [218-221], flooding [222], ozone treatment [223], fungal infections [224-226], bacterial 
infections [217,227], other infections [228], and multiple stresses [229,230]. 

There are two major research strategies in metabolomic studies: metabolic fingerprinting and 
metabolic profiling. Metabolic fingerprinting uses the mass-to-charge ratio of mass spectrometry, the 
peak height and/or retention time of chromatography and the strength of NMR signal as the metabolomic 
signature in specific samples, such that the identity of each metabolite is not necessary [231]. This 
helps to classify different samples into categories. For example, metabolic fingerprints have been made 
to differentiate between disease-resistant and susceptible varieties [217,227,228] or between 
salt-tolerant and sensitive varieties [232]. In one study, fourier transform infrared (FT-IR) 
spectroscopy was used for the metabolic fingerprinting of salt-treated tomatoes [233]. A total of 882 
FT-IR spectra variables were collected between the wave number 4000 to 600 cm -1 for each 
sample [233]. Through discriminant function analysis (DFA) of the spectra variables, without knowing 
the identity and the quantity of each metabolite, salt-treated and control samples can be discriminated. 
Furthermore, key regions within the spectrum distinguishing the treated from the untreated samples 
were identified through genetic algorithms, and the major components were found to be amino radicals 
and nitrile-containing compounds [233]. Thus, disease resistance and stress tolerance of novel crop 
varieties can be assessed by comparing their metabolic fingerprints with those of well characterized 
varieties, facilitating the screening process. 

On the other hand, metabolic profiling compares the metabolic compositions between samples and 
hence the quantitation and identification of the metabolites are required. Signal patterns must be matched 
to known standards or depositions in the databases in order to identify the actual compounds. For 
example, the accumulation of compatible solutes, such as proline, glycine-betaine, and their precursors, 
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is usually observed in osmotically stressed crops, especially in tolerant varieties [216,220,221]. 
A specific example is the mitochondrial metabolic profile of flood-stressed soybean; metabolites were 
extracted from the roots and hypocotyls of soybean seedlings with or without submergence 
stress [226], and were then analyzed using capillary electrophoresis mass spectrometry. Eighty-one 
mitochondria-related metabolites were identified and quantified with reference to the commercially 
available standards [226]. There was an accumulation of TCA cycle-related metabolites, including 
citrate, succinate, and aconitate, but a reduction in ATP in flood-stressed plants, which can be 
explained by the arrest of aerobic respiration due to anoxia [222]. Following a similar logic, the 
accumulation of antimicrobial compounds, such as caffeic acid, phytoalexins, glycoalkaloids, and 
other polyphenolic compounds, are common in pathogen-infected crops compared to their uninfected 
counterparts [224-226,228]. Glucose oxidase secreted by a fungal pathogen, Botrytis cinerea, can also 
lead to the accumulation of gluconic acid in Vitis vinifera cv. Chardonnay berries [226]. 

5. Future Perspectives 

Sequencing throughput is no longer the major limiting factor in genomics and transcriptomics 
studies. The next generation sequencing platforms can actually generate enough depth for genome 
assembly in one or several runs [234]. However, sequence assembly and annotation for complex 
genomes remain challenging. The data acquisition platforms for other "-omics", on the other hand, are 
under rapid development to catch up with the pace of genomic research. While the data source is no 
longer a rate-determining step, data integration and interpretation have become the bottleneck in the 
research pipeline. One obstacle hindering the cross-platform analyses of different datasets is the 
variations in experimental designs, treatment conditions, and data formats. Drawing meaningful 
conclusions may sometimes be difficult when there are discrepancies between two germplasms. For 
example, the transcriptomic data of one germplasm may not be used effectively to explain the 
proteomic data of another germplasm. Researchers should therefore strategically design experiments to 
generate interrelated -omics data using carefully selected germplasms. The standardization of data 
acquisition and storage formats using strictly controlled vocabulary is also important. 

With the advance of computer technology and high-throughput analysis platforms, life processes 
can now be captured, digitized, and stored in the hard disk of a computer. Yet, no matter how perfectly 
a genome is sequenced and assembled, biological data from experiments are still essential to connect 
the genotypes and the phenotypes. A few softwares/platforms have been developed to integrate the 
interactions of cellular components into networks [235,236]. For example, the VirtualPlant has been 
developed as a software platform for the integration and analysis of different levels of data [237]. 
It provides large datasets of Arabidopsis gene annotation, gene functional categories, microarray data, 
biochemical pathways, interaction information, and microRNA:mRNA interaction information. Users 
can also upload their own gene lists and microarray data for analysis, and identify coexpressed genes, 
interacting proteins, and metabolites associated with their genes of interest. Building a similar platform 
for crop plants could be extremely useful but it requires a well-coordinated effort among different 
research centers and groups. 
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